To start off with, let’s load data and libs:
library(tidyverse)
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages -----------------------------------------------
filter(): dplyr, stats
lag(): dplyr, stats
library(CKMRsim)
library(stringr)
#meta <- readRDS("../data/processed/meta-data-tibble.rds") # not worrying about meta data right now...
genos <- readRDS("../data/processed/called_genos_na_explicit.rds") #%>%
#filter(NMFS_DNA_ID %in% meta$NMFS_DNA_ID) # drop those we don't have meta data for
samples <- readRDS("../data/processed/sample-sheet-tibble.rds") #%>%
#filter(NMFS_DNA_ID %in% meta$NMFS_DNA_ID)
I’m not sure if there are any of these, but best to leave it in here… particularly for the re-genotyped fish from Brittany’s OSU samples.
Now, here is a harder operation: if an individual is multiply-genotyped, take the genotype with the highest total read depth.
# slow-ish function to get the total read depth column
tdepth <- function(a, d) {
if(any(is.na(a))) {
return(NA)
}
if(a[1]==a[2]) {
return(d[1])
} else {
return(d[1] + d[2])
}
}
# this takes the highest read-depth instance of each duplicately-genotyped individual.
geno_one_each <- genos %>%
group_by(NMFS_DNA_ID, locus, gtseq_run, id) %>%
mutate(total_depth = tdepth(allele, depth)) %>%
ungroup() %>%
arrange(NMFS_DNA_ID, locus, total_depth, gtseq_run, id, depth) %>%
group_by(NMFS_DNA_ID, locus) %>%
mutate(rank = 1:n()) %>%
ungroup() %>%
filter(rank <= 2)
# read in a list of the 6 loci
to_remove <- read_csv("../data/loci_to_remove.csv")
Parsed with column specification:
cols(
locus = col_character()
)
# only keep the loci that are not those 6
keepers <- geno_one_each %>%
anti_join(., to_remove, by = "locus")
# that should leave 90 loci
Now, toss out any individual with fewer than 75 non-missing loci
no_hi_missers <- keepers %>%
group_by(NMFS_DNA_ID) %>%
filter(sum(!is.na(allele)) >= (75*2))
So, we started with 1674 and after filtering out indivs with fewer than 75 genotyped loci, we were left with 1436 individuals. Those are the ones that we will run through rubias to identify to species.
# read in genotypes identified to species using rubias
spp <- read_csv("../data/reported_haplotype_SebSppID_11102017.csv")
Missing column names filled in: 'X1' [1]Parsed with column specification:
cols(
X1 = col_integer(),
group = col_character(),
locus = col_character(),
indiv.ID = col_character(),
haplotype.1 = col_character(),
haplotype.2 = col_character(),
read.depth.1 = col_integer(),
read.depth.2 = col_integer(),
potential.issue = col_integer(),
obs.freq = col_double(),
expected.freq = col_double()
)
select_spp <- spp %>%
select(group, locus, indiv.ID, haplotype.1, haplotype.2)
spp.id <- select_spp %>%
gather("gene_copy", "allele", 4:5) %>%
mutate(gene_copy = ifelse(gene_copy == "haplotype.1", 1, 2))
# only keep the loci that are not the 6 removed from the previous dataset
spp.id_loc <- spp.id %>%
anti_join(., to_remove, by = "locus")
# that should leave 90 loci
spp.id_loc1 <- spp.id_loc %>%
mutate(sample_type = "reference")
x <- spp.id_loc1 %>%
mutate(repunit = group)
# reorder the columns
spp.id1 <- x[,c(6,7,1,3,2,4:5)]
spp.id2 <- spp.id1 %>%
rename(collection = group) %>%
rename(indiv = indiv.ID)
# create an artificial ID for the spp.id
#spp.id2 <- spp.id1 %>%
# unite(NMFS_DNA_ID, 1:2)
# get the data frames into the same format
no_hi_missers2 <- no_hi_missers %>%
dplyr::select(NMFS_DNA_ID, locus, gene_copy, allele) %>%
rename(indiv = NMFS_DNA_ID) %>%
mutate(sample_type = "mixture") %>%
mutate(repunit = NA) %>%
mutate(collection = "osu_samples")
# reorder
no_hi_missers2[, c(5:7,1:4)]
# combine the data into a single df
alleles <- bind_rows(spp.id2, no_hi_missers2)
alleles
We are going to do this by turning alleles into integers and spreading it and then getting it into the right format to run rubias.
# first make integers of the alleles
alle_idxs <- alleles %>%
#dplyr::select(NMFS_DNA_ID, locus, gene_copy, allele) %>%
group_by(locus) %>%
mutate(alleidx = as.integer(factor(allele, levels = unique(allele)))) %>%
ungroup() %>%
arrange(indiv, locus, alleidx) #%>%
#mutate(alleidx = ifelse(is.na(alleidx), 0, alleidx)) # do not do this for Rubias - leave as NA's
# select just the columns to retain and spread the alleles
alle_idx2 <- alle_idxs[,-7]
two_col <- alle_idx2 %>%
unite(loc, locus, gene_copy, sep = ".") %>%
spread(loc, alleidx)
two_col
# write this file to a thing that can be read-into other softwares
#two_col %>%
# write_csv("csv_outputs/genos_two_col.csv")
Okay, after some reading, it looks like I need to use infer_mixture in rubias. Which require two separate data frames, one with the reference genotypes and the other with the mixture.
I’ll split the data frame that I created, but it needed to be bunged together for the conversion of alleles to integers.
The posterior means of group membership in each collection is in the PofZ column - there are 1,220 individuals with a PofZ = 1.
Maybe some of these samples just had a PofZ larger/smaller than 1? (eventually I will update brit_data with all of the appropriate meta data for this project.)
Using Hayley’s threshold of 0.95 for the PofZ, we have 1,436 individuals retained.
Now take a look at which fish from Brittany’s samples are included:
That didn’t increase the number of samples retained by much…
Are there really 375 samples that don’t have genotypes?
all_assign <- mix_assign$indiv_posteriors %>%
group_by(indiv)
# which of brittany's samples don't have genotype info?
brit_data %>%
anti_join(., all_assign, by = c("NMFS_DNA_ID" = "indiv"))
Okay, those 375 samples just aren’t in the data set. They’re not just filtered out by the 0.95 PofZ assignment threshold.
To compare Brittany’s assignments with rubias, I would need to separate the string of the repunit to get the species alone without the “S” prefix.
as.character(batch_4792$SAMPLE_ID)
[1] "4" "5" "6" "8" "9" "10" "11" "12" "15" "16" "17"
[12] "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28"
[23] "29" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39"
[34] "40" "41" "42" "43" "44" "45" "46" "47" "48" "49" "50"
[45] "51" "52" "53" "54" "55" "56" "57" "58" "59" "60" "61"
[56] "62" "63" "64" "65" "66" "67" "68" "69" "70" "71" "72"
[67] "73" "74" "75" "76" "77" "78" "79" "80" "81" "82" "83"
[78] "84" "85" "86" "87" "88" "89" "90" "91" "92" "93" "94"
[89] "95" "96" "97" "98" "99" "100" "101" "102" "103" "104" "105"
[100] "106" "107" "108" "109" "110" "111" "112" "113" "114" "115" "116"
[111] "117" "118" "119" "120" "121" "122" "123" "124" "125" "126" "127"
[122] "128" "129" "130" "131" "132" "133" "134" "135" "136" "137" "138"
[133] "139" "140" "141" "142" "143" "144" "145" "146" "147" "148" "149"
[144] "150" "151" "152" "153" "155" "154" "156" "157" "158" "159" "160"
[155] "161" "162" "163" "164" "165" "166" "167" "168" "169" "170" "171"
[166] "172" "173" "174" "175" "176" "177" "178" "179" "180" "181" "182"
[177] "183" "184" "185" "186" "187" "188" "189" "190" "191" "192" "193"
[188] "194" "195" "196" "197" "198" "199" "200" "201" "202" "203" "204"
[199] "205" "206" "207" "208" "209" "210" "211" "212" "213" "214" "215"
[210] "216" "217" "218" "219" "220" "221" "222" "223" "224" "225" "226"
[221] "227" "228" "229" "230" "231" "232" "233" "234" "235" "236" "237"
[232] "238" "239" "240" "241" "242" "243" "244" "245" "246" "247" "248"
[243] "249" "250" "251" "252" "253" "254" "255" "256" "257" "258" "259"
[254] "260" "261" "262" "263" "264" "265" "266" "267" "268" "269" "270"
[265] "271" "272" "273" "274" "275" "276" "277" "278" "279" "280" "281"
[276] "282" "283" "284" "285" "286" "287" "288" "289" "290" "291" "292"
[287] "293" "294" "295" "296" "297" "298" "299" "300" "301" "302" "303"
[298] "304" "305" "306" "307" "308" "309" "310" "311" "312" "313" "314"
[309] "315" "316" "317" "318" "319" "320" "321" "322" "323" "324" "325"
[320] "326" "327" "328" "329" "330" "331" "332" "333" "334" "335" "336"
[331] "337" "338" "339" "340" "341" "342" "343" "344" "345" "346" "347"
[342] "348" "349" "350" "351" "352" "353" "354" "355" "356" "357" "358"
[353] "359" "360" "361" "362" "363" "364" "365" "366" "367" "368" "369"
[364] "370" "371" "372" "373" "374" "375" "377" "378" "379" "380" "381"
[375] "382" "383" "384" "385" "386" "387" "388" "389" "390" "391" "392"
[386] "393" "394" "395" "396" "397" "398" "399" "400" "401" "402" "403"
[397] "404" "405" "406" "407" "408" "409" "410" "411" "412" "413" "414"
[408] "415" "416" "417" "418" "419" "420" "421" "422" "423" "424" "425"
[419] "426" "427" "428" "429" "430" "431" "432" "433" "434" "435" "436"
[430] "437" "438" "439" "440" "441" "442" "443" "444" "445" "446" "447"
[441] "448" "449" "450" "451" "452" "453" "454" "455" "456" "457" "458"
[452] "459" "460" "461" "462" "463" "464" "465" "466" "467" "468" "469"
[463] "470" "471" "472" "473" "474" "475" "476" "477" "478" "479" "480"
[474] "481" "482" "483" "484" "485" "486" "487" "488" "489" "490" "491"
[485] "492" "493" "494" "495" "496" "497" "498" "499" "500" "501" "502"
[496] "503" "504" "505" "506" "507" "508" "509" "510" "511" "512" "513"
[507] "514" "515" "516" "517" "518" "519" "520" "521" "522" "523" "524"
[518] "525" "526" "527" "528" "529" "530" "531" "532" "533" "534" "535"
[529] "536" "537" "538" "539" "540" "541" "542" "543" "544" "545" "546"
[540] "547" "548" "549" "550" "551" "552" "553" "554" "555" "556" "557"
[551] "558" "559" "560" "561" "562" "563" "564" "565" "566" "567" "568"
[562] "596" "570" "571" "572" "573" "574" "575" "576" "577" "578" "579"
[573] "580" "581" "582" "583" "584" "585" "586" "587" "588" "589" "590"
[584] "591" "592" "593" "594" "595" "596" "597" "598" "599" "600" "601"
[595] "602" "603" "604" "605" "606" "607" "608" "609" "610" "611" "612"
[606] "613" "614" "615" "616" "617" "618" "619" "620" "621" "622" "623"
[617] "624" "625" "626" "627" "628" "629" "630" "631" "632" "633" "634"
[628] "635" "636" "637" "638" "639" "640" "641" "642" "643" "644" "645"
[639] "646" "647" "648" "649" "650" "651" "652" "653" "654" "655" "656"
[650] "657" "658" "659" "660" "661" "662" "663" "664" "665" "666" "667"
[661] "668" "669" "670" "671" "672" "673" "674" "675" "676" "677" "678"
[672] "679" "680" "681" "682" "683" "684" "685" "686" "687" "688" "689"
[683] "690" "691" "692" "693" "694" "695" "696" "697" "698" "699" "700"
[694] "701" "702" "703" "704" "705" "706" "707" "708" "709" "710" "711"
[705] "712" "713" "714" "715" "716" "717" "718" "719" "720" "721" "722"
[716] "723" "724" "725" "726" "727" "728" "729" "730" "731" "732" "733"
[727] "734" "735" "736" "737" "738" "739" "740" "741" "742" "743" "744"
[738] "745" "746" "747" "748" "749" "750" "751" "752" "753" "754"
Warning message:
Unknown or uninitialised column: 'str'.
The meta file should contain all the samples for Brittany’s project.
192 samples. That’s still an unfortunate lot. I wonder if they are the same samples that didn’t genotype well last time?
Here’s what Cassie said was in that run: This run has box R373 and two mixed plates (1 of those mixed plates is partial). Plate 1 has the samples from plate R374 that need species ID – Brittany’s samples in 1A - 3A; samples for Helen Killeen in 3B - 4A; two re-picks from your NFS project in 4B and 4C; and ONE tentative Mexican rockfish in 4D. It also has re-runs for Brittany’s project. The two NSF re-picks are highlighted in the attached list of black and yellow sibs that you gave me. We re-extracted these since there was tissue remaining and we also re-ran the original DNA extracts in this run. Plate 2rr are all re-runs including your NSF black and yellow sibs and the rest of the failed samples from Bailey’s run for Brittany’s project.
gtseq63 %>%
filter(ssBOX_ID == "R374")
# which samples from gtseq63 don't have kept assignments?
gtseq63 %>%
anti_join(., kept_assignments, by = c("NMFS_DNA_ID" = "indiv")) %>%
filter(ssBOX_ID == "R133")
All the samples from R373 were not re-runs. (23 samples) The 4 samples that failed from R374 were for Helen Killeen.
47 -27 = 20 samples that failed from the re-runs? Was that out of 108 total?